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processing units (GPUs). So, this study proposes a new joint learning (JL) 
approach to classify human activities using inertial sensors. To this end, a 
Keywords: large complex donor model based on a convolutional neural network (CNN) 
is used to transfer knowledge to a smaller model based on CNN referred to as 
the acceptor model. The acceptor model can be deployed on mobile devices 
and low-power hardware due to decreased computing costs and memory 


Convolutional neural network 
Deep neural network 


Graphics processing unit consumption. The wireless sensor data mining (WISDM) dataset is used to 
Human activity recognition test the proposed model. According to the experimental results, the HAR 
Joint learning system based on the JL algorithm outperforms than other methods. 
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1. INTRODUCTION 

Inertial sensors such as gyroscopes, accelerometers, and magnetometers are widely used in various 
fields, including human activities and patient monitoring [1], [2], sports activities [3], and Persian character 
recognition due to the development of their manufacturing technology, cheapness, portability, wearability, and 
support for various communication protocols, e.g., Wi-Fi and Bluetooth. These sensors have advantages in the 
field of human activities, including the posiibility of attaching them to certain parts of the body and easily 
collecting valuable data from people's daily activities such as sitting, walking, jumping, and cycling. 
Accelerometers and gyroscopes are the most common sensors embedded in smartphones. Accelerometers are 
electronic sensors that measure the acceleration applied to the sensor in the x, y, and z axes, while gyroscopes 
are devices for measuring angular velocity or maintaining rotational motion. Once raw signals are collected, 
there are first processed by inertial sensors to remove noise and then are recognized and classified. 

Recently, there have been many studies on the classification of data collected from inertial sensors. 
Machine learning algorithms such as multilayer perceptron, support vector machine, and decision tree, have 
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achieved successful results. However, manual feature extraction by these algorithms requires specialized 
knowledge and is sometimes not possible due to large volume of inputs. Meanwhile, the emergence of powerful 
deep learning tools, e.g., convolutional neural network (CNN) [4] and long short-term memory 
(LSTM) [5]-[7] and their success in various fields in the past decade have led to a high capability to extract 
features from large volumes of data. So, researchers can search and extract high-level features and complex 
patterns that were previously difficult to be extracted or could not be extracted automatically in layers without 
the need to use traditional methods and manual feature extraction. Accordingly, they can achieve the best pre- 
processing and feature selection in the defined framework by using deep learning algorithms and applying a 
huge amount of data. 

Nowadays, deep learning models have received more attention from researchers than traditional 
models and have provided a significant contribution to improving advanced functions. Howevere, they have 
many parameters due to their complexity and large size. Thus, such complex models can be trained on powerful 
graphics processing units (GPUs) and tensor processing units (TPUs). It should be noted that they require high 
computational costs and large storage spaces. In this way, deploying massive deep learning models on low- 
power hardware devices such as smartphones, robots, and embedded chips is a challenging issue that has drawn 
researchers’ attention to investigate methods for reducing computational costs and number of parameters in 
deep learning models without reducing their speed and accuracy. These methods have been optimized by 
replacing computationally intensive operations, pruning approaches, quantizing model parameters to reduce 
memory requirements and increase inference speed, using hash methods, as well as developing faster and more 
lightweight architectures for inference [8]—[10]. 

There are many studies on the human activity recognition (HAR), the data of which has been collected 
by the use of inertial sensors with deep learning techniques. For example, in a study by Yen ef al. [11], 6 daily 
HARs by a CNN were investigated. Xu et al. [12] applied a HAR method based on improved 1D-CNN to 11 
human activities. Hsu et al. [3] developed a wearable sport activity classification system and the associated 
deep learning-based sport activity classification algorithm for accurately recognizing sports activities. Many 
studies have been conducted on monitoring elderly people’s activities. For example, Shang et al. [13] 
investigated the monitoring of elderly people's bathing activities through data collection by wrist-mounted 
accelerometers. They also used the CNN model for classification. According to the calculations, the CNN 
layers require a memory of 343.94 MB. 

Santoyo-Ramon ef al. [14] investigated the effects of sensor sampling frequency on fall detection 
during daily activities based on raw acceleration signals captured by wearable sensors and classified the data 
using the CNN. In their study, Lawal and Bano [4] designed an integrated two-stream CNN to predict human 
activities and they tested the datasets from different body parts in order to predict the best positions. According 
to the calculations, the two-stream CNN layers consumed 9.047 MB of memory. Ignatov [15] applied a user- 
independent deep learning-based approach in order to classify online human activities. In this study, it is 
proposed to use the CNN and extract the local features along with the simple statistical features that preserve 
the data concerning the global form of the time series. 

Various studies have combined different models, e.g., CNN with LSTM or Hidden Markov models, 
in order to improve the performance of these classifiers. For example, Wang ef al. [5] introduced a hybrid 
LSTM_CNN algorithm for the HAR. This algorithm handles time-dependent data with multiple features. To 
classify human activities, Khatun et al. [16] proposed a hybrid deep learning model that combines the CNN 
and CNN-LSTM. In a study by Sepahvand and Abdali-Mohammadi [9], the adversarial vulnerability of gait 
event detection through the CNN and LSTM models was investigated. Serrao et al. [17] applied three deep 
learning architectures, i.e., CNN, LSTM, and a gated recurrent unit (GRU) network, in order to classify 17 
classes of the UniMiB SHAR dataset. Their applied CNN requires 1.74 MB of memory. In a study by Gupta 
[18], a hybrid GRU-CNN deep learning model was investigated for the accurate classification of 18 simple and 
complex human daily activities from the wireless sensor data mining (WISDM) dataset. 

Researchers have used a combination of different algorithms and models in order to improve deep 
learning models. For example, Raziani and Azimbagirad [19] used a 1-D CNN model to classify human 
activities and investigated seven approaches based on the grey wolf optimizer (GWO), whale optimization 
algorithm (WOA), salp swarm algorithm (SSA), sine cosine algorithm (SCA), multiverse optimizer (MVO), 
particle swarm optimization (PSO), and moth flame optimization (MFO) for automatic selection of optimal 
meta-parameters of the CNN model. Although these methods have a good performance in classification of 
activities, however the high number of parameters and the computational complexity of these methods have 
limited their use in edge devices that have weak hardware. 

In their study on the LSTM-ConvAE architecture for the HAR, Thakur ef al. [20] suggested that 
CNNs, automatic encoders (AEs), and LSTMs have complementary modeling capabilities because CNNs are 
adept at automatic feature extraction, AEs at dimensionality reduction, and LSTMs at temporal modeling. Ina 
study by Shang et al. [13], CNN and the arithmetic optimization algorithm (AOA) were used on three different 
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general datasets, i.e, HAR-UCI, HAR-WISDM, and HAR-KU. The calculations performed on the CNN 
parameters require 14.224 MB of memory. Jaberi and Ravanmehr [2] proposed the use of the CNN model and 
an optimization technique using the ternary weight nets (TWN) model in order to reduce the complexity of the 
deep neural network approach, and as a result, decrease energy consumption by mobile devices to increase the 
accuracy of the HAR. 

Sepahvand ef al. [10] proposed an adaptive HAR real-time monitoring system in order to classify 
more human motions in dynamic situations using 4 machine learning algorithms, i.e., k-nearest neighbors 
(KNN), decision trees (DTs), artificial neural network (ANN), and support vector machines (SVMs), and a 
deep learning algorithm called the LSTM. The results indicated that classification by the LSTM had a higher 
accuracy. Muralidharan et al. [21] applied classic machine learning algorithms such as logistic regression, 
linear and kernel SVM, decision tree, and random forest in order to classify the HAR from the obtained data. 
They also applied a deep feedforward neural network and a 1D-CNN and compared them with the single 
classifier algorithms. Gil-Martin et al. [22] proposed a HAR system consisting of three modules, the first of 
which divided acceleration signals into overlapping windows and extracted the data from each window in the 
frequency domain. The second module recognized the activities performed in each window using a CNN 
model. The calculations performed on the CNN parameters required 77.06 MB of memory [23]-[25]. 

In this study, a very effective and promising joint learning (JL) approach is used to recognize 6 human 
activities including walking, jogging, going upstairs, going downstairs, sitting, and standing. In this process, 
the CNN-based model with a complex and heavy architecture called donor, which can extract important 
features from a large amount of data and make a favorable prediction, is first trained. The donor model then 
trains a CNN-based model with a small and light architecture called acceptor through JL. In this way, the 
knowledge in the larger model can be distilled into a smaller and faster model with the capability of being 
implemented on mobile devices so that the acceptor model can be obtained with almost the same or higher 
accuracy as that of the donor model. 

The main goals of the study include: 

— Reducing the parameters of the acceptor model in order to decrease the computing costs and memory 
consumption for deploying the acceptor model on the mobile device; and 
— Proving that the proposed method brings about the best results on the WISDM dataset. 


2. THE PROPOSED METHOD 
2.1. General structure 

According to Figure 1, the JL model consists of 2 modules: the donor network and the acceptor 
network. The donor network has a structure different from the acceptor network. The main goal is to provide 
a vanilla distillation framework, in the sense that a large pre-trained donor network transfers its knowledge to 
the acceptor network. As it can be seen in Figure 1, the signals collected by the three-axis accelerometer of the 
smartphone in the WISDM laboratory are first applied as inputs to the JL model. The softmax of the donor 
network then changes with the value of temperature (t). The softmax output is the class probabilities produced 
by the donor network at the output. This temperature value depends on the input data and is obtained 
experimentally. The same temperature is used for the softmax of the acceptor, the output of which is called the 
soft prediction. The softmax activation function is used in the output layer in order to normalize the network 
output and convert it into a probability distribution. Also, the sum of the probabilities of all the classes is 1. All 
the probabilities of classes that are not output are close to 0, and the probability distribution for the true output 
is maximum. The cross-entropy loss function is then calculated for the soft and hard labels. Finally, the acceptor 
network is trained by minimizing the final loss function obtained from the soft and hard label loss functions. 


2.2. Learning transfer method 

The method of dark knowledge transfer in vanilla JL is described: z denotes the output logit vector 
of the last fully connected layer of the acceptor network, k is the number of target classes, and T represents a 
temperature coefficient. The higher its value, the more uniform distribution p tends to. Also, p is the 
probability of the output of the acceptor network for the i-th class, which is calculated in (1) and (2). In (2), 
zo denotes the output logit vector of the last fully connected layer of the donor network, and q is the 
probability of the output of the donor network for the i-th class, which is calculated, 
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where X is the input data, and y is a k-dimensional vector derived using the one-hot encoding method. The 
cross-entropy loss functions used for the soft and hard labels are calculated as (3) and (4). The acceptor network 
can be trained by minimizing the following loss: where a denotes the weighting, factor used for the loss based 
on the soft and hard labels, and R is the number of input samples X as (5). 
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Figure |. Joint deep learning architecture 


3. RESULTS AND DISCUSSION 

The results of the study are obtained from the performance of the donor and acceptor models based 
on a 1-D CNN along with the JL algorithm to increase the accuracy of the acceptor model. The Adam optimizer 
is used to update the Hyperparameter. The loss value is calculated using the cross-entropy function to predict 
the HAR. It is worth noting that the test is performed on a PC with Microsoft Windows 10, Intel(R) Core (TM) 
i5-11400H processor, 16 GB RAM, and Nvidia GeForce RT X3050 GPU. The evaluation criteria used for the 
proposed HAR algorithm in the test phase include accuracy, recall, precision, and Fl-score. The main 
parameters true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are used to 
calculate the accuracy, recall, precision, and F1-score criteria. 


TP+TN 
Accuracy = —————_ (6) 
TP+TN+FP+FN 
Precision = (7) 
TP+FP 
TP 
Recall = —— (8) 
TP+FN 
PrecisionxRecall 
F1_Score = 2 -—— (9) 
Precision+Recall 
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3.1. Dataset 
The WISDM dataset consists of 1098203 samples collected from smartphone accelerometer sensors 


wow wow 


in a laboratory environment and includes 6 different human activities, i.e., "standing", "walking", "sitting", 
"jogging", "going downstairs", and "going upstairs". The feature of this dataset is three-axis linear acceleration 
obtained with a fixed sampling rate of 20 Hz. The dataset is divided into 2 different groups: 70% for the training 


data and 30% for the test data. 


3.2. Baseline 
The proposed method is compared with the following studies: 

— CNN [26]: In this method, CNN is used to recognize and classify human activities. 

— CNN+Handcrafted features [2]: In this study, human activities are recognized using a shallow CNN 
model to extract local features along with statistical features. 

— LSTM_CNN [27]: In this study, the combination of LSTM and CNN models is used to improve the HAR 
performance. 

— ConvAE_LSTM [20]: In this study, ConvAE_LSTM architecture is proposed, in which features are 
automatically extracted through CNN, AE is used for dimensionality reduction, and LSTM is used for 
temporal modeling. 


3.3. Evaluation of donor model, acceptor model, and JL 

In this test, the donor model with 124,126 parameters and memory consumption of ~993 KB is trained 
on the 6 human activities. Furthermore, the compressed acceptor model with 7.964 parameters and ~63 KB 
memory consumption is trained by the proposed JL method. The results of the classification evaluated by the 
donor model are listed Table 1, where Fl-score is 91%, 90%, 97%, and 97% for going downstairs, going 
upstairs, sitting, and standing, respectively, and 99% for jogging and walking. Besides, the results of the 
classification evaluated by the acceptor model are reported in Table 2, where the maximum F1-score is 98% 
for walking and the minimum F1-score is 81% for going upstairs. 

The confusion matrix for the 6 activities is shown in Figure 2. Jogging and walking are well classified 
because higher changes in the gyroscope numerical value are observed when performing these activities. 
Figure 2(a) shows the confusion matrix for the student model, while Figure 2(b) shows the confusion matrix 
for the teacher model. 


Table 1. The performance of the classification by the donor CNN model 


Fl-Score Recall Precision Activity 
0.91 0.93 0.89 Downstairs 
0.99 0.99 1.00 Jogging 
0.97 0.94 0.99 Sitting 
0.97 0.99 0.94 Standing 
0.90 0.87 0.93 Upstairs 
0.99 0.99 0.98 Walking 
0.97 accuracy 
0.95 0.95 0.95 Macro avg 
0.97 0.97 0.97 Weighted avg 


Table 2. The performance of the classification by the acceptor CNN model 


Fl-Score Recall Precision Activity 
0.82 0.91 0.74 Downstairs 
0.97 0.95 0.99 Jogging 
0.97 0.94 0.99 Sitting 
0.96 0.99 0.93 Standing 
0.81 0.78 0.84 Upstairs 
0.98 0.98 0.98 Walking 
0.94 accuracy 
0.92 0.93 0.91 Macro avg 
0.94 0.94 0.94 Weighted avg 


Figures 3(a) and 3(b) show the accuracy of the donor and acceptor model trained by the JL method, 
respectively. The accuracy of the test set is more than 97% for the donor model and more than 95% for the 
trained acceptor model. Figures 3(c) and 3(d) show the loss values obtained from the donor and acceptor 
models, which are gradually decreasing. 
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Figure 2. The confusion matrix on 6 human activities, (a) the acceptor CNN model and (b) the donor CNN 
model 


3.4. Comparing the proposed method with previous studies 

The accuracies of the methods used in some studies on human activities are compared with that of the 
proposed method using inertial sensors as shown in Table 3. The overall accuracy of the donor CNN model is 
97.40%. In comparison, the CNN classification [26] increases the accuracy by 2.61. Moreover, this method 
increases the accuracy by 1.55 compared to the combined LSTM+CNN method [27]. The accuracy increases 
to 6.98 compared to CNN+Handcrafted features [2]. However, the combined ConvAE_LSTM method [20] 
increases the accuracy by 1.47 compared to the proposed method. The proposed method makes a significant 
contribution to improving the performance and accuracy of the CNN model with a small number of parameters 
for deploying on low-power hardware devices through JL. 
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Figure 3. Learning curves during the training by the JL algorithm, (a) accuracy plot related to the donor's 
network, (b) accuracy plot related to the acceptor's network, (c) loss plot related to the donor's network, and 
(d) loss plot related to the acceptor's network 


Table 3. Comparison of the proposed method with previous studies 


Accuracy Methods 
94.79% CNN [1] 
95.75% CNN+Handcrafted features [2] 
95.85% LSTM+CNN [3] 
98.14% ConvAE_LSTM [20] 
96.90% + 0.53% CNN_Donor 
94% + 0.2% CNN_Acceptor 
95% + 0.62% JL_Acceptor 
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4. CONCLUSION 

This study proposes a solution to the problem of deploying deep learning models that can solve 
complex problems on smartphones by using a more complex donor model and a less complex acceptor model 
that are based on the CNN. The performance of the proposed method is tested using a dataset on human 
activities. By training on 6 different human activities, the donor model extracts the important and complex 
features of the data and achieves a good prediction with an accuracy of 97.40%. According to the experimental 
results, the proposed donor model achieves higher accuracy compared to the other methods. Furthermore, this 
model reaches an accuracy of 95.62% by transferring knowledge from the donor model to the acceptor model. 
The proposed method allows the use of a CNN-based model with less computational complexity and a high 
accuracy on mobile devices. One of our main ideas to improve the proposed method in the future is to use a 
meta-learning approach. 
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